In [1]:
import sys,os,gzip
from collections import defaultdict
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
%load_ext sql


/home/glandrum/anaconda3/lib/python3.4/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
/home/glandrum/anaconda3/lib/python3.4/site-packages/IPython/config.py:13: ShimWarning: The `IPython.config` package has been deprecated. You should import from traitlets.config instead.
  "You should import from traitlets.config instead.", ShimWarning)
/home/glandrum/anaconda3/lib/python3.4/site-packages/IPython/utils/traitlets.py:5: UserWarning: IPython.utils.traitlets has moved to a top-level traitlets package.
  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")

In [2]:
sys.path.append(os.path.sep.join(os.path.split(os.getcwd())[:-1]))

In [3]:
import splitter

Our test set here includes the 16 million molecules from the old ZINC clean set that could be successfully processed by the RDKit.

We use the Standard InChI that comes with ChEMBL and a non-standard InChI (options "/FixedH /SUU") that allows tautomers to be distinguished. Here's the sequence of psql commands used to generate that set:


In [4]:
%sql postgresql://localhost/inchi_split \
    select count(*) from zinc_clean_nonstandard;


1 rows affected.
Out[4]:
count
16390000

Big caveat here: I forgot the last commit in my loading script, so the last block of structures is missing.

Formula level grouping


In [5]:
d = %sql \
    select formula,count(zinc_id) freq from zinc_clean_nonstandard group by formula \
    order by freq desc limit 10;
d


10 rows affected.
Out[5]:
formula freq
/C17H26N2O2 13585
/C21H27N3O3 12718
/C16H24N2O2 12549
/C19H29N3O3 12371
/C19H26N4O2 12248
/C21H26N2O3 12228
/C19H27N3O3 11905
/C19H29N3O2 11891
/C20H24N2O3 11831
/C19H25N3O3 11609

grouping on the main layer


In [6]:
d = %sql \
    select formula,skeleton,hydrogens,count(zinc_id) freq from zinc_clean_nonstandard group by \
    (formula,skeleton,hydrogens) \
    order by freq desc limit 10;


10 rows affected.

Look at a few of the common main layer groups


In [7]:
d[:5]


Out[7]:
[('/C20H34O5', '/c1-2-3-6-9-15(21)12-13-17-16(18(22)14-19(17)23)10-7-4-5-8-11-20(24)25', '/h4,7,12-13,15-19,21-23H,2-3,5-6,8-11,14H2,1H3,(H,24,25)', 36),
 ('/C28H44O', '/c1-19(2)20(3)9-10-22(5)26-15-16-27-23(8-7-17-28(26,27)6)12-13-24-18-25(29)14-11-21(24)4', '/h9-10,12-13,19-20,22,25-27,29H,4,7-8,11,14-18H2,1-3,5-6H3', 32),
 ('/C15H19NO6', '/c1-8(17)16-11-12(18)13-10(21-14(11)19)7-20-15(22-13)9-5-3-2-4-6-9', '/h2-6,10-15,18-19H,7H2,1H3,(H,16,17)', 31),
 ('/C12H20O6', '/c1-11(2)14-5-6(16-11)8-7(13)9-10(15-8)18-12(3,4)17-9', '/h6-10,13H,5H2,1-4H3', 30),
 ('/C8H15NO6', '/c1-3(11)9-5-7(13)6(12)4(2-10)15-8(5)14', '/h4-8,10,12-14H,2H2,1H3,(H,9,11)', 29)]

In [13]:
tpl=d[0][:-1]
print(tpl)
rows = %sql \
  select zinc_id,smiles from zinc_clean join zinc_clean_nonstandard using (zinc_id) where \
    (formula,skeleton,hydrogens) = :tpl 
cids = [x for x,y in rows][:9]
ms = [Chem.MolFromSmiles(y) for x,y in rows][:9]
Draw.MolsToGridImage(ms,legends=cids)


('/C20H34O5', '/c1-2-3-6-9-15(21)12-13-17-16(18(22)14-19(17)23)10-7-4-5-8-11-20(24)25', '/h4,7,12-13,15-19,21-23H,2-3,5-6,8-11,14H2,1H3,(H,24,25)')
36 rows affected.
Out[13]:

In [14]:
tpl=d[1][:-1]
print(tpl)
rows = %sql \
  select zinc_id,smiles from zinc_clean join zinc_clean_nonstandard using (zinc_id) where \
    (formula,skeleton,hydrogens) = :tpl
cids = [x for x,y in rows][:9]
ms = [Chem.MolFromSmiles(y) for x,y in rows][:9]
Draw.MolsToGridImage(ms,legends=cids)


('/C28H44O', '/c1-19(2)20(3)9-10-22(5)26-15-16-27-23(8-7-17-28(26,27)6)12-13-24-18-25(29)14-11-21(24)4', '/h9-10,12-13,19-20,22,25-27,29H,4,7-8,11,14-18H2,1-3,5-6H3')
32 rows affected.
Out[14]:

In [15]:
tpl=d[4][:-1]
print(tpl)
rows = %sql \
  select zinc_id,smiles from zinc_clean join zinc_clean_nonstandard using (zinc_id) where \
    (formula,skeleton,hydrogens) = :tpl
cids = [x for x,y in rows][:9]
ms = [Chem.MolFromSmiles(y) for x,y in rows][:9]
Draw.MolsToGridImage(ms,legends=cids)


('/C8H15NO6', '/c1-3(11)9-5-7(13)6(12)4(2-10)15-8(5)14', '/h4-8,10,12-14H,2H2,1H3,(H,9,11)')
29 rows affected.
Out[15]:

Charges


In [16]:
d = %sql \
    select formula,skeleton,hydrogens,charge,protonation,count(zinc_id) freq from zinc_clean_nonstandard group by \
    (formula,skeleton,hydrogens,charge,protonation) \
    order by freq desc limit 10;
d[:5]


10 rows affected.
Out[16]:
[('/C20H34O5', '/c1-2-3-6-9-15(21)12-13-17-16(18(22)14-19(17)23)10-7-4-5-8-11-20(24)25', '/h4,7,12-13,15-19,21-23H,2-3,5-6,8-11,14H2,1H3,(H,24,25)', None, '/p-1', 36),
 ('/C28H44O', '/c1-19(2)20(3)9-10-22(5)26-15-16-27-23(8-7-17-28(26,27)6)12-13-24-18-25(29)14-11-21(24)4', '/h9-10,12-13,19-20,22,25-27,29H,4,7-8,11,14-18H2,1-3,5-6H3', None, None, 32),
 ('/C15H19NO6', '/c1-8(17)16-11-12(18)13-10(21-14(11)19)7-20-15(22-13)9-5-3-2-4-6-9', '/h2-6,10-15,18-19H,7H2,1H3,(H,16,17)', None, None, 31),
 ('/C12H20O6', '/c1-11(2)14-5-6(16-11)8-7(13)9-10(15-8)18-12(3,4)17-9', '/h6-10,13H,5H2,1-4H3', None, None, 30),
 ('/C8H15NO6', '/c1-3(11)9-5-7(13)6(12)4(2-10)15-8(5)14', '/h4-8,10,12-14H,2H2,1H3,(H,9,11)', None, None, 29)]

In [21]:
tpl=d[0][:-1]
tpl = tuple(x if x is not None else '' for x in tpl)
print(tpl)
rows = %sql \
  select zinc_id,smiles from zinc_clean join zinc_clean_nonstandard using (zinc_id) where \
    (formula,skeleton,hydrogens,coalesce(charge,''),coalesce(protonation,'')) = :tpl
cids = [x for x,y in rows][:9]
ms = [Chem.MolFromSmiles(y) for x,y in rows][:9]
Draw.MolsToGridImage(ms,legends=cids)


('/C20H34O5', '/c1-2-3-6-9-15(21)12-13-17-16(18(22)14-19(17)23)10-7-4-5-8-11-20(24)25', '/h4,7,12-13,15-19,21-23H,2-3,5-6,8-11,14H2,1H3,(H,24,25)', '', '/p-1')
36 rows affected.
Out[21]:

Stereo grouping


In [26]:
d = %sql \
    select formula,skeleton,hydrogens,charge,protonation,stereo_bond,stereo_tet,stereo_m,stereo_s,count(zinc_id) freq \
    from zinc_clean_nonstandard where stereo_bond is not null or stereo_tet is not null \
    group by \
    (formula,skeleton,hydrogens,charge,protonation,stereo_bond,stereo_tet,stereo_m,stereo_s) \
    order by freq desc limit 10;
d[:5]


10 rows affected.
Out[26]:
[('/C18H18Cl2N2O2S2', '/c1-18(2)8-12-14(13(23)9-18)16(25-15(12)20)26(3)22-17(24)21-11-6-4-10(19)5-7-11', '/h4-7H,8-9H2,1-3H3,(H,21,24)', None, None, None, '/t26?', None, None, 4),
 ('/C28H32N2O4', '/c1-17(18-9-7-6-8-10-18)30-23-14-12-20-21(16-24(23)31)22(29-2)13-11-19-15-25(32-3)27(33-4)28(34-5)26(19)20', '/h6-10,12,14-17,22,29H,11,13H2,1-5H3,(H,30,31)', None, '/p+1', None, '/t17-,22+', '/m1', '/s1', 3),
 ('/C12H16Br2O2', '/c13-11-2-8-1-10(5-11,4-9(15)16)6-12(14,3-8)7-11', '/h8H,1-7H2,(H,15,16)', None, '/p-1', None, '/t8?,10?,11-,12+', None, None, 3),
 ('/C21H16Cl2N2O3S', '/c1-29(25-21(27)24-17-8-6-16(23)7-9-17)19-12-10-18(11-13-19)28-20(26)14-2-4-15(22)5-3-14', '/h2-13H,1H3,(H,24,27)', None, None, None, '/t29?', None, None, 3),
 ('/C26H30N4O6', '/c1-15(2)13-20(25(34)35)28-22(31)16(3)27-23(32)21(14-17-9-5-4-6-10-17)30-24(33)18-11-7-8-12-19(18)29-26(30)36', '/h4-12,15-16,20-21H,13-14H2,1-3H3,(H,27,32)(H,28,31)(H,29,36)(H,34,35)', None, '/p-1', None, '/t16-,20-,21-', '/m0', '/s1', 3)]

In [27]:
tpl=d[0][:-1]
tpl = tuple(x if x is not None else '' for x in tpl)
print(tpl)
rows = %sql \
  select zinc_id,smiles from zinc_clean join zinc_clean_nonstandard using (zinc_id) where \
    (formula,skeleton,hydrogens,\
     coalesce(charge,''),coalesce(protonation,''),coalesce(stereo_bond,''),\
     coalesce(stereo_tet,''),coalesce(stereo_m,''),coalesce(stereo_s,'')) = :tpl   
cids = [x for x,y in rows]
ms = [Chem.MolFromSmiles(y) for x,y in rows]
Draw.MolsToGridImage(ms,legends=cids)


('/C18H18Cl2N2O2S2', '/c1-18(2)8-12-14(13(23)9-18)16(25-15(12)20)26(3)22-17(24)21-11-6-4-10(19)5-7-11', '/h4-7H,8-9H2,1-3H3,(H,21,24)', '', '', '', '/t26?', '', '')
4 rows affected.
Out[27]:

In [28]:
tpl=d[1][:-1]
tpl = tuple(x if x is not None else '' for x in tpl)
print(tpl)
rows = %sql \
  select zinc_id,smiles from zinc_clean join zinc_clean_nonstandard using (zinc_id) where \
    (formula,skeleton,hydrogens,\
     coalesce(charge,''),coalesce(protonation,''),coalesce(stereo_bond,''),\
     coalesce(stereo_tet,''),coalesce(stereo_m,''),coalesce(stereo_s,'')) = :tpl   
cids = [x for x,y in rows]
ms = [Chem.MolFromSmiles(y) for x,y in rows]
Draw.MolsToGridImage(ms,legends=cids)


('/C28H32N2O4', '/c1-17(18-9-7-6-8-10-18)30-23-14-12-20-21(16-24(23)31)22(29-2)13-11-19-15-25(32-3)27(33-4)28(34-5)26(19)20', '/h6-10,12,14-17,22,29H,11,13H2,1-5H3,(H,30,31)', '', '/p+1', '', '/t17-,22+', '/m1', '/s1')
3 rows affected.
Out[28]:

an aside

I discovered this little InChI pathology while doing this work. I spent a good half hour trying to track down the RDKit bug that made it happen before realizing that it's by design. Note that this is with FixedH InChIs.


In [30]:
td = %sql \
select t2.zinc_id,t2.nonstandard_inchi,t2.smiles from zinc_clean_nonstandard t1 join zinc_clean t2 using (zinc_id) \
where (formula,skeleton,hydrogens,charge)=\
('/C29H33N2','/c1-28(2)22-16-12-14-18-24(22)30(5)26(28)20-10-8-7-9-11-21-27-29(3,4)23-17-13-15-19-25(23)31(27)6',\
 '/h7-21H,1-6H3','/q+1')
print(td)
cids = [x for x,y,z in td]
ms = [Chem.MolFromSmiles(z) for x,y,z in td]
Draw.MolsToGridImage(ms,legends=cids)


6 rows affected.
+--------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------+
|   zinc_id    |                                                         nonstandard_inchi                                                          |                             smiles                             |
+--------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------+
| ZINC04701292 | InChI=1/C29H33N2/c1-28(2)22-16-12-14-18-24(22)30(5)26(28)20-10-8-7-9-11-21-27-29(3,4)23-17-13-15-19-25(23)31(27)6/h7-21H,1-6H3/q+1 | CN1/C(=C\C=C/C=C\C=C/C2=[N+](C)c3ccccc3C2(C)C)C(C)(C)c2ccccc21 |
| ZINC12405219 | InChI=1/C29H33N2/c1-28(2)22-16-12-14-18-24(22)30(5)26(28)20-10-8-7-9-11-21-27-29(3,4)23-17-13-15-19-25(23)31(27)6/h7-21H,1-6H3/q+1 | CN1/C(=C/C=C/C=C/C=C/C2=[N+](C)c3ccccc3C2(C)C)C(C)(C)c2ccccc21 |
| ZINC19940218 | InChI=1/C29H33N2/c1-28(2)22-16-12-14-18-24(22)30(5)26(28)20-10-8-7-9-11-21-27-29(3,4)23-17-13-15-19-25(23)31(27)6/h7-21H,1-6H3/q+1 | CN1/C(=C\C=C\C=C\C=C\C2=[N+](C)c3ccccc3C2(C)C)C(C)(C)c2ccccc21 |
| ZINC35614446 | InChI=1/C29H33N2/c1-28(2)22-16-12-14-18-24(22)30(5)26(28)20-10-8-7-9-11-21-27-29(3,4)23-17-13-15-19-25(23)31(27)6/h7-21H,1-6H3/q+1 | CN1/C(=C\C=C/C=C\C=C\C2=[N+](C)c3ccccc3C2(C)C)C(C)(C)c2ccccc21 |
| ZINC35614448 | InChI=1/C29H33N2/c1-28(2)22-16-12-14-18-24(22)30(5)26(28)20-10-8-7-9-11-21-27-29(3,4)23-17-13-15-19-25(23)31(27)6/h7-21H,1-6H3/q+1 | CN1/C(=C\C=C/C=C/C=C\C2=[N+](C)c3ccccc3C2(C)C)C(C)(C)c2ccccc21 |
| ZINC35614450 | InChI=1/C29H33N2/c1-28(2)22-16-12-14-18-24(22)30(5)26(28)20-10-8-7-9-11-21-27-29(3,4)23-17-13-15-19-25(23)31(27)6/h7-21H,1-6H3/q+1 | CN1/C(=C\C=C/C=C/C=C/C2=[N+](C)c3ccccc3C2(C)C)C(C)(C)c2ccccc21 |
+--------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------+
Out[30]:

Sucks to be you if it's important to you that those molecules be different and you're using InChI. Note that at least ZINC12405219 and ZINC19940218 are, according to ZINC, separately available from vendors

Isotopes


In [42]:
%sql \
select count(*) \
    from zinc_clean_nonstandard where isotope is not null


1 rows affected.
Out[42]:
count
0

No need here, this set has no labelled compounds. That's likely a property of how the ZINC clean set was constructed.

examples where the tautomerism leads to new tetrahedral symmetry


In [57]:
rows = %sql \
  select zinc_id,smiles,nonstandard_inchi from zinc_clean join zinc_clean_nonstandard using (zinc_id) where \
     fixedh_stereo_tet is not null and position('?' in fixedh_stereo_tet)<=0 and stereo_tet!=fixedh_stereo_tet 
len(rows)


161 rows affected.
Out[57]:
161

In [58]:
cids = [x for x,y,z in rows][:10]
ms = [Chem.MolFromSmiles(y) for x,y,z in rows][:10]
Draw.MolsToGridImage(ms,legends=cids)


Out[58]:

Not much interesting there. There's no simple query to find questionable tautomer motion. :-)

Examples where tautomerism leads to new bond stereochemistry


In [68]:
rows = %sql \
  select zinc_id,smiles,nonstandard_inchi from zinc_clean join zinc_clean_nonstandard using (zinc_id) where \
     fixedh_stereo_bond is not null and fixedh_stereo_bond!='/b' and position('?' in fixedh_stereo_bond)<=0 and stereo_bond!=fixedh_stereo_bond 
len(rows)


167 rows affected.
Out[68]:
167

In [69]:
cids = [x for x,y,z in rows][:10]
ms = [Chem.MolFromSmiles(y) for x,y,z in rows][:10]
Draw.MolsToGridImage(ms,legends=cids)


Out[69]:

Not much interesting in those first results


In [ ]: